Dataset Information
Exploratory Analysis
Analysis
This study investigates particulate matter (PM 2.5) air quality data alongside four Social Vulnerability Index (SVI) metrics across North Carolina counties and visually compares their proximity against retired and operating Power Plant locations. Environmental justice is an ever-concerning issue in America, and with the sudden increase in electricity consumption from the rise of data centers, it is important as ever to make sure injustices are not overlooked. The origins of the environmental justice movement stem from Warren County, North Carolina when an Africa-American community was chosen to be the location of a hazardous waste landfill, sparking national conversations about systemic environmental inequities. This historical context is the reason North Carolina was chosen as the area of focus for this research.
The study focuses on coal-burning power plants in particular because they are a major source of air pollution, specifically the very dangerous pollutant PM 2.5. PM 2.5 from coal combustion is rich in sulfur dioxide, black carbon, and metals which can enter a human’s lungs and bloodstream leading to conditions such as cancer, asthma, and even premature death. The NIH estimated from a study that for every 1 μg/m3 increase in coal PM 2.5, mortality in the studied regions increased by 1.12%. Given North Carolina’s unsettling environmental justice history, we seek to explore the connections between coal-burning power plants, amounts of PM 2.5 air pollution, and social vulnerability indices to potentially reveal disparities in air quality and publlic health impacts.
Is there a correlation between PM 2.5 concentrations in North Carolina and the proximity of coal-burning power plants?
Do counties with higher Social Vulnerability Indexes (SVI) have higher concentrations of PM 2.5?
Are there more coal-burning powerplants in these counties?
Are there more clean energy powerplants in lower SVI counties?
Is there a three-way relationship between coal-burning power plants, SVI, and PM 2.5 concentrations in North Carolina?
This research consists of 5 different datasets: Power Plants, North Carolina Retired Generators, SVI Indexes, and Particulate Matter 2.5 Air Quality.
The Power Plant data set was collected from the open data site of the Geospatial Management Office of the U.S. Department of Homeland Security. The shapefile was created for the Homeland Infrastructure Foundation-Level Database and the Energy modeling community at large. This data contains electric power plants around the United States including the following plant types: hydroelectric dams, fossil fuels (coal, natural gas, or oil), nuclear, solar, wind, geothermal, and biomass. The man classifications that are used in this study are plant name, state location of plant, status of plant (Operating or Retired), primary fuel of plant, and geographic location.
The GeoJSON option of this data was copied and pasted into R Studio to bring in this dataset. This research narrowed the scope down to plants located in North Carolina. Then, primary fuel was narrowed to narrow out the plants that would not produce PM 2.5. According to EPA’s EGrid information, the abbreviations of plant primary fuel’s that are found within this dataset represent the following: BIT (blast furnace gas), AB (agricultural byproduct), BLQ (black liquor), DFO (distillate fuel oil, light fuel oil, FO2, diesel oil), LFG (landfill gas), NG (natural gas), OBG (digester gas, methane, and other biomass gases), SLW (sludge waste), WDS (wood, wood waste solid), WH (waste heat), SUN (solar), WND (wind), WAT (water), MWH (electricity), and NUC (nuclear). These plants were broken up into four categories: BIT only plants, other relatively moderate PM producing plants (WDS, SLW, BLQ, AB, and DFO), very low PM producing plants (LFG, NG, OBG, and WH), and a combination of BIT plants and the moderately producing PM plants. This last data frame was filtered for operating status of either operating or retired. Then, a second data set was merged to the retired plants data frame to add the column of retirement years. This data set is from U.S. Energy Information Administration ( https://www.eia.gov/electricity/data/eia860m/ ) and contains all generators retired in North Carolina as of October 2024. This was merged into the retired plants data frame by Plant Code and Plant ID.
The Particulate Matter (PM2.5) data set was collected from the United States Environmental Protection Agency website’s Outdoor Air Quality Data section (https://www.epa.gov/outdoor-air-quality-data). Using its “Download Daily Data” tool, we queried daily air quality summary statistics for the criteria pollutant PM2.5 by monitor in North Carolina for the years 2013 to 2022. The dataset includes values for Daily Mean PM2.5 Concentration, Daily AQI Value, and Daily Observation Count.
Since we are aiming to examine the relationship between PM2.5 pollution, social vulnerability indices, and power plants, we focused on the Daily Mean PM2.5 Concentration values and did not take Daily AQI Value and Daily Observation Count into consideration for the purposes of this study. After importing the EPA’s daily PM2.5 datasets for each of the years from 2013 to 2022 as data frames. We then converted these data frames into spatial data frames by using group_by() and only selecting the columns necessary to visualize and map each year’s PM2.5 data by county in North Carolina. The selected columns were: Site ID, County, Site Latitude, Site Longitude. New columns were also created using summarize() and named meanPM and maxPM, which provided us with the average of the Daily Mean PM2.5 Concentration and the maximum of the Daily Mean PM2.5 Concentration, respectively, for 2013 to 2022.
The North Carolina Counties dataset was primarily uploaded to incorporate spatial information for the above datasets. This integration enables the mapping of data, which is a critical step for visualization and spatial analysis. This was used for linking the powerplant and SVI data with geographic boundaries, which helped facilitate the creation of maps that visually represent social vulnerability and powerplant locations across North Carolina counties, aiding in identifying spatial patterns.
| Variable | Description | Units |
|---|---|---|
| PLANT_CODE | Power Plant Code ID | Character |
| NAME | Name of Power Plant | Character |
| STATE | State Plant is | |
| Located | Character | |
| STATUS | Operating Status of Plant | Character |
| COUNTY | County Plant is Located | Character |
| COUNTYFIPS | County | |
| FIPS | Character | |
| PRIM_FUEL | Primary Fuel of Plant | Character |
| LATITUDE | Latitude of Plant | Double-precision decimal number |
| LONGITUDE | Longitude of Plant | Double-precision decimal number |
| Variable | Description | Units |
|---|---|---|
| FIPS | County FIPS Code | Character |
| STATE | State | Character |
| RPL_THEME1 | Theme 1 Percentile Ranking | Double-precision decimal number |
| RPL_THEME2 | Theme 2 Percentile Ranking | Double-precision decimal number |
| RPL_THEME3 | Theme 3 Percentile Ranking | Double-precision decimal number |
| RPL_THEME4 | Theme 4 Percentile Ranking | Double-precision decimal number |
| RPL_THEMES | Overall Summary Ranking Variable | Double-precision decimal number |
| Variable | Description | Units |
|---|---|---|
| Date | Date Data was Daten | Character |
| Site ID | ID Number of Site Location | Number |
| Daily Mean PM2.5 Concentration | Daily Mean of PM 2.5 Concentration | Number in units of ug/m^3 |
| Local Site Name | Name of Site | Character |
| State | State of Site | Character |
| County | County of Site | Character |
| Site Latitude | Site Latitude | Double-precision decimal number |
| Site Longitude | Site Longitude | Double-precision decimal number |
| Variable | Description | Units |
|---|---|---|
| Plant.Name | Name of Plant | Character |
| Plant.ID | Plant ID | Integer |
| Retirement.Year | Year Plant Retired | Integer |
| Variable | Description | Units |
|---|---|---|
| STATEFP | State Code, 37 for NC | Character |
| COUNTY | County Name | Character |
| geometry | Geometry | sfc_MULTIPOLYGON |
Figure 6.1 is a map of North Carolina Coal-Burning Power Plants with BIT Primary Fuel Source.
Figure 6.2 is a map of North Carolina Power Plants including BIT plants and 5 other Primary Fuel Types (WDS, SLW, BLQ, AB, DFO).
Figure 6.3 is a map of Operating vs. Retired North Carolina Power Plants. As shown, all of the retired plants are BIT primary fuel plants, showing that the highest PM emitting plants are being retired over other types.
Figure 6.4 is an interactive map of Mean PM 2.5 data, SVI Index, and Power Plants.
These maps provide a visualization of PM2.5 concentration levels and SVI percentiles across North Carolina over several two-year time periods (2013–2014, 2015–2016, 2017–2018, 2019–2020, and 2021–2022). The maps highlight distinct regional trends in both air quality and social vulnerability, illustrating how these factors evolve over time and interact geographically.
In general, the PM2.5 concentration levels demonstrate a statewide downward trend, with lower concentrations observed in later years compared to earlier periods. However, the western part of the state consistently shows areas with higher PM2.5 concentrations, particularly near counties where operating power plants and other industrial facilities are located. For instance, counties near active BLQ and DFO units in western North Carolina display elevated PM2.5 levels in the earlier years, which may reflect their proximity to emissions sources.
The SVI percentiles reveal clusters of higher social vulnerability in the eastern part of the state, particularly in counties where PM2.5 monitors are sparse. These areas with higher social vulnerability often overlap with historically underserved communities that may be at greater risk due to a lack of air quality monitoring and mitigation measures. Some western counties, such as Jackson and Swain, show increasing SVI percentiles alongside worsening PM2.5 levels in certain periods. Counties like Cleveland and Mecklenburg reflect notable changes over time. Cleveland County, home to a plant in Shelby, shows increased social vulnerability after 2014, coinciding with higher environmental risks. Mecklenburg County exhibits worsening PM2.5 levels and an increased SVI percentile by 2022. Montgomery County and Halifax County also experience slight increases in PM2.5 levels and SVI values, likely influenced by the presence of nearby industrial units. Interestingly, Hyde County in the eastern region exhibits a rising SVI percentile despite a general reduction in PM2.5 levels over the years, suggesting that social factors driving vulnerability may not be directly linked to air quality improvements. The maps also highlight scale adjustments in PM2.5 measurements, with generally lower concentration levels in later years, reflecting statewide efforts to reduce particulate pollution.
These findings underscore the complex relationship between air quality and social vulnerability, highlighting the need for targeted interventions in areas facing persistent social and environmental challenges. While the western part of the state, characterized by lower SVI values, appears to have higher PM2.5 concentrations, it is crucial to consider the lack of PM2.5 monitors in the eastern regions where SVI tends to be higher. This monitoring gap may obscure evidence of potentially rising PM2.5 concentrations in areas with higher social vulnerability, underscoring the need for improved air quality surveillance in underserved regions.
Figure 7.1: Correlation Ellipse Plot
Figure 7.1 Analysis of Correlation Plot:
When looking at the Correlation Ellipse Plot, we are looking at (i) the elliptical shape, (ii) the direction/slope of the ellipse, and (iii) the intensity of the ellipse’s color to examine pairwise correlations. According to our correlation plot, the strongest correlations across 2014 to 2022 appear for the following variable pairs: (a) the aggregate of all four SVI themes and SVI theme 1, and (b) the aggregate of all four SVI themes and SVI theme 4, since the narrower ellipse shapes and darker ellipse colors suggest a very high correlation among these pairs. This makes sense since SVI theme 1 is one of the four themes that make up the SVI aggregate. Although the elliptical shapes and color intensities suggest strong correlations between the other SVI themes, individually, and the SVI aggregate, it is not as strong as the two pairs of variables first mentioned.
Figure 7.2: Mixed Correlation Plot
Figure 7.2 Further correlation plot analysis:
When looking at the mixed correlation plot, which combines the ellipses with the exact correlation values, we can confirm our prior observations when we see the correlation values for (a) the SVI aggregate and SVI theme 1 and (b) the SVI aggregate and SVI theme 4 are 0.85 and 0.78, respectively. Unfortunately, since the aim of the study is to examine the relationship, if any, between PM2.5 concentration, number of power plants, and SVI indices by county in North Carolina, the results from the correlation plots do not tell us much, if anything, about such a relationship.
Using the AIC method to select variables to compute a stepwise regression that either adds explanatory variables from the bottom up or removes explanatory variables from a full set of suggested options. The smaller the AIC value, the better.
step(multi.lin.reg)
## Start: AIC=49.31
## meanPM ~ theme1_value + theme2_value + theme3_value + theme4_value +
## Number_of_Plants
##
## Df Sum of Sq RSS AIC
## - theme3_value 1 0.6500 111.01 47.654
## <none> 110.36 49.313
## - theme4_value 1 3.9605 114.32 49.358
## - Number_of_Plants 1 4.3537 114.72 49.557
## - theme2_value 1 7.7762 118.14 51.262
## - theme1_value 1 13.3570 123.72 53.939
##
## Step: AIC=47.65
## meanPM ~ theme1_value + theme2_value + theme4_value + Number_of_Plants
##
## Df Sum of Sq RSS AIC
## <none> 111.01 47.654
## - Number_of_Plants 1 4.7465 115.76 48.082
## - theme2_value 1 8.0459 119.06 49.712
## - theme4_value 1 10.5679 121.58 50.928
## - theme1_value 1 13.1739 124.19 52.158
##
## Call:
## lm(formula = meanPM ~ theme1_value + theme2_value + theme4_value +
## Number_of_Plants, data = combined_data)
##
## Coefficients:
## (Intercept) theme1_value theme2_value theme4_value
## 6.3963 -2.6181 1.7813 1.8738
## Number_of_Plants
## 0.1641
summary(multi.lin.reg)
##
## Call:
## lm(formula = meanPM ~ theme1_value + theme2_value + theme3_value +
## theme4_value + Number_of_Plants, data = combined_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3413 -0.8170 0.0277 0.8120 2.6377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.3234 0.6025 10.496 1.9e-14 ***
## theme1_value -2.6377 1.0514 -2.509 0.0153 *
## theme2_value 1.7538 0.9162 1.914 0.0611 .
## theme3_value 0.5452 0.9852 0.553 0.5824
## theme4_value 1.4891 1.0901 1.366 0.1778
## Number_of_Plants 0.1579 0.1103 1.432 0.1581
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.457 on 52 degrees of freedom
## (70 observations deleted due to missingness)
## Multiple R-squared: 0.2195, Adjusted R-squared: 0.1445
## F-statistic: 2.925 on 5 and 52 DF, p-value: 0.02115
Analysis of multi linear regression/AIC:
The AIC analysis and stepwise regression process aimed to identify the best-fitting linear regression model to potentially predict meanPM by minimizing the AIC value. The initial model included theme1_value, theme2_value, theme3_value, theme4_value, and Number_of_Plants as predictors, with an AIC of 49.31. During the stepwise selection process, it was determined that removing theme3_value resulted in the largest reduction in AIC, improving the value to 47.65. This indicates that them3_value (minority status and language) contributed the least out of the four themes for explaining variance in meanPM. The final model retained theme1_value, theme2_value, theme4_value, and Number_of_Plants as predictors, with an AIC of 47.65.
The summary of the multiple linear regression model provides more insights in the relationship between meanPM and the potential predictors. The model’s intercept is statistically significant (p<0.001), indicating the baseline value of meanPM when all predictors are zero is about 6.32 micrograms. Among the predictors, theme1_value is statistically significant with a coefficient of –2.6377 (p = 0.0153). This negative relationship suggests that higher values of these index may be associated with lower PM2.5 concentrations. Theme2_value shows a positive coefficient of 1.7538 and approaches significance (p = 0.0611), suggesting a potential positive association with PM2.5 concentrations, though further analysis and/or data may be needed to confirm this. The remaining predictors (theme 3, theme 4, and number of plants) were not statistically significant as their p-values exceeded the 0.05 threshold, suggesting their relationship with meanPM is weaker, less consistent, or even non-existent in the model. The model’s overall performance, as indicated by the R-squared value of 0.1445, shows that the predictors collectively explain approximately 14.45% of the variance in meanPM. This relatively low R-squared value suggests there are additional factors influencing PM2.5 concentrations that are not captured by the model. The F-statistic (2.925, p = 0.02115) indicates that the model, as a whole, is statistically significant, meaning the predictors contribute to explaining variations in PM2.5 concentrations to some extent.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
Figure 7.3: PM 2.5 Data vs Number of Plants Plot
## `geom_smooth()` using formula = 'y ~ x'
Figure 7.4: PM 2.5 Data vs SVI Aggregate Plot
Figure 7.3 and Figure 7.4:
Analysis of scatterplots: After plotting the relationships of (i) mean PM2.5 and the number of power plants from 2014 to 2022 and (ii) mean PM2.5 and the SVI aggregate from 2014 to 2022, the resulting scatterplots confirm that there appears to be no significant, or there is a very miniscule, relationship between PM2.5 and each of the number of power plants and the SVI aggregate for the years 2014 to 2022.
“Environmental Justice History.” U.S. Department of Energy, www.energy.gov/lm/environmental-justice-history.
Doctrow, Brian. “Deaths associated with pollution from coal power plants.” National Institutes of Health, 12 Dec. 2023, www.nih.gov/news-events/nih-research-matters/deaths-associated-pollution-coal-power-plants.
“EGrid .” Environmental Protection Agency, www.epa.gov/egrid/code-lookup.
5.2 Social Vulnerability Index:
The CDC’s Social Vulnerability Index (SVI) dataset provides bi-annual data on indices that measure social vulnerability. These indices are organized into four key themes, each representing the average of various indicators: Theme 1 (Socioeconomic Status), Theme 2 (Household Composition and Disability), Theme 3 (Minority Status and Language), and Theme 4 (Housing Type and Transportation). Additionally, the dataset includes an overall summary ranking variable, which aggregates the values from these themes to provide a comprehensive measure of social vulnerability.
Geodatabase data was pulled from the CDC website (https://www.atsdr.cdc.gov/place-health/php/svi/svi-data-documentation-download.html) in order to retain spatial information for mapping. There was an issue with importing 2014 geodatabase (GDB) data, so 2014 data was imported first as a CSV file. Data was filtered to include only counties in North Carolina (using STATEFP == 37). A join was performed between the NC Counties spatial data frame and the 2014 SVI data using the GEOID column from the spatial data and the FIPS column from the SVI data. Subsequently, SVI data for North Carolina counties from 2016, 2018, 2020, and 2022 were read as geodatabases as these were able to work straight from the website. To have dataframes that also allowed for GLM and linear regression analysis, additional non-spatial dataframes for SVI were created by using st_drop_geometry() to drop the spatial component of the SVI data.
Next, the code focuses on creating consolidated datasets for analysis. Four separate themes (Theme 1, Theme 2, Theme 3, and Theme 4) and an aggregate measure (RPL_THEMES) are processed by selecting the relevant theme columns from each dataset (2014, 2016, 2018, 2020, and 2022) and performing inner joins using FIPS as the key. The resulting data frames provide a longitudinal view of each theme’s values across the years for each county. Column names are updated to reflect the year associated with each column. Finally, the code consolidates the themes and aggregate measures into respective data frames for further analysis, preparing the data for generalized linear model (GLM) analysis or other statistical approaches.